Power Tools 1993 November

home *** CD-ROM | disk | FTP | other *** search

/ Power Tools 1993 November - Disc 2 / Power Tools Plus (Disc 2 of 2)(November 1993)(HP).iso / hotlines / gsyhl / hawhite / append.txt < prev next >

Wrap

Text File | 1993-05-12 | 7KB | 139 lines

APPENDIX A -- SwitchOver/UX Product Operation SwitchOver/UX consists of a set of programs running on the primary and the standby hosts. (A host is the collection of programs and data used by a processor.) There are two "daemon" programs which are used to monitor the state of the primary hosts. (A daemon program is one which runs automatically in the background to provide certain system services.) A primary host in a highly available group runs a daemon called heartbeat. This program periodiclly transmits a message to the standby host over the local area network. The standby host runs a daemon called readpulse, which "listens" for the heartbeat messages. There are five phases to SwitchOver/UX's operation: 1) Normal health checking 2) Fault detection and recovery 3) Application recovery 4) Resume processing 5) Processor repair Normal Health Checking During normal operation, each primary in a loosely-coupled processor group sends out a "heartbeat" across the LAN to the standby system informing it that everything is functioning properly (Figure 1). The standby system "listens" for these heartbeats and as long as it receives these heartbeats, it will continue to run its own applications. Fault Detection and Recovery When the standby misses a "heartbeat" from a primary processor, it assumes the primary has failed and begins the recovery process (Figure 2). First, the standby processor locks the root disk of the primary processor. This prevents the primary processor from inadvertently accessing the disks and possibly corrupting data once the standby has assumed the responsibilities of the failed primary. Next, the standby reboots itself using the root of the failed system. When the standby finishes rebooting, it will also have the network address of the failed system. NOTE: Once the standby initiates the recovery process, it can not be reversed until the processor recovery is completed. Also, any disks the standby was accessing prior to recovery will not be accessible until the failed primary is repaired. Application Recovery After the standby system has finished assuming the identity of the failed system, the application needs to go through their own recovery routines. Most databases support recovery from a reboot and can automatically bring themselves to a consistent state. Custom applications need to be structured to recover from a reboot. Applications can be automatically restarted through the use of HP-UX's standard initialization and start-up files. Resume Processing Users log back into the system using their normal procedure, and re-start their individual applications. Users do not need to know that they are running on a different processor (Figure 3). For most database applications, users may need to check on their last transaction before processing was interrupted. Typically, the current transaction will be lost when processing is interrupted. The amount of data loss will depend upon the individual application and database. Batch applications which are interrupted will also need to be re-started. Depending upon the recovery capabilities of the application, either it will need to be re-started from the beginning or from a point where the state of the program and data are consistent. After the standby has assumed the responsibilities of the failed primary, it becomes the primary and the failed primary is treated as a standby system. The failed system is then repaired using HP's normal repair processes with the assistance of a system administrator or repair technician. Once the system is repaired, it can be brought up as a standby for the other primaries or it can resume being a primary (Figure 4). In order to return the failed primary to being a primary again, the standby which took over for the primary will need to be shutdown so the primary can regain its disks and network address. This switchover should be scheduled when the system is lightly loaded and users can afford the brief downtime. Recovery Time The recovery time for a system is very application dependent. Figure 5 diagrams the recovery process and shows which stages are time dependent upon customer applications. There are three key components to recovery time: 1) Fault detection 2) System recovery 3) Application recovery Fault Detection Fault detection is determined by the frequency of the heartbeats and how many heartbeats the standby will allow to go by before it initiates recovery procedures. These two parameters are definable by the system administrator. Typically this stage of the recovery process will take less than one minute. System Recovery During system recovery the standby processor reboots and checks all disk file systems to correct any problems that may have been created when the primary processor failed. The reboot time is dependent upon the class of machine (i.e. 827, 832, etc.) and the amount of RAM memory. The disk checking time depends on how much disk space is used for the file system and the state these files were in at the time of the failure. This component may be the largest part of the recovery time. It may be significantly reduced by using databases and applications which access the disks directly and bypass the HP-UX file system. The most recent versions of industry leading database products typically bypass the file system (using raw disk) which helps to minimize recovery times. Application Recovery Once the processor has recovered and the file system is intact, the application needs to perform its own recovery. For databases this would mean rolling transactions back to a known state and then rolling them forward, completing all committed transactions and discarding all incomplete transactions. The time it takes for this stage is entirely dependent upon the application and the transaction rate prior to the failure. This portion of the recovery time can be minimized by choosing databases and structuring applications to recover quickly from system reboots.